SPICE: Semantic Propositional Image Caption Evaluation
نویسندگان
چکیده
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as which caption-generator best understands colors? and can caption-generators count?
منابع مشابه
Learning to Evaluate Image Captioning
Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rulebased metrics lack provisions to repair such blind spots once identified. For example, the newly proposed SPICE correlate...
متن کاملAutomatic Semantic Analysis of Television News Captions
Automatic indexing to image data is in strong demand. Utilizing accompanying natural language information is considered e ective to accomplish the task. As a basis for semantic indexing, we propose an automatic television caption semantic analysis method, which analyzes semantic attributes of Japanese television news captions referring to su xes. This is a basic pre-process required to enable a...
متن کاملExploring Nearest Neighbor Approaches for Image Captioning
We explore a variety of nearest neighbor baseline approaches for image captioning. These approaches find a set of nearest neighbor images in the training set from which a caption may be borrowed for the query image. We select a caption for the query image by finding the caption that best represents the “consensus” of the set of candidate captions gathered from the nearest neighbor images. When ...
متن کاملUsing Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine
Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...
متن کاملApplications of Scene Attributes
In this paper, we study the feasibility of scene attributes as the intermediate scene representation for automatic image captioning, tag predicting and semantic image search. we show that when used as features for these tasks, low dimensional scene attributes can compete with or improve on the state of art performance. In particular, we propose a new method of content-based image retrieval, whi...
متن کامل